There is an image with shuffled rows - http://biit.cs.ut.ee/imgshuffle/index.cgi?fname=DM2016&dname=DM2016 You can
get access to RGB values in here - http://biit.cs.ut.ee/imgshuffle/data/DM2016/DM2016.txt (uploaded file here) You can re-order row id’s in any order and upload them to the same webpage to recover image in that new order of rows. New image for those who know what is in the first image and want to have fun finding out what’s in the picture (txt file).
For this task I use Kmeans to achieve this but the image still too fuzzy to show any distinct difference from original image given.
2. Use the same data matrix from task 1 and run a PCA analysis on it. Plot first three principal components as 2-dimensional plots PC1-PC2, PC1-PC3, PC2-PC3 of these data or as a 3D plot. Check out PCA example
For this question I had some DLL issues since I am on mac and the library I want to use apparently only works on windows I couldn’t move forward with the question
census2000 = read.csv("extract_medium.csv", sep = ";", header = T)
cast(census2000, Education ~ Sex)#Sex + Education)
## Using Earnings as value column. Use the value argument to cast to override this choice
## Aggregation requires fun.aggregate: length used as default
## Education 1 2
## 1 0 485 505
## 2 1 627 620
## 3 2 1094 1035
## 4 3 689 616
## 5 4 573 655
## 6 5 337 368
## 7 6 330 366
## 8 7 360 395
## 9 8 404 429
## 10 9 1907 2186
## 11 10 603 769
## 12 11 1595 1641
## 13 12 549 608
## 14 13 1733 1808
## 15 14 576 578
## 16 15 244 155
## 17 16 99 61
#pca3d(pcadata, group=data[,1], bg = "grey")
On the same data - try to visualize other relationships in data - based on ancestry, industry, marital status and education, for example.
Read the Jim Gray - Data Cube abstraction. Describe the key operators from this article using examples based on above census data (tasks 3-4). (Alternative list of operations - https://en.wikipedia.org/wiki/OLAP_cube#Operations )
Dice: The dice operation produces a subcube by allowing the analyst to pick specific values of multiple dimensions.[5] The picture shows a dicing operation: The new cube shows the sales figures of a limited number of product categories, the time and region dimensions cover the same range as before.
Drill Down/Up allows the user to navigate among levels of data ranging from the most summarized (up) to the most detailed (down).[4] The picture shows a drill-down operation: The analyst moves from the summary category “Outdoor-Schutzausrüstung” to see the sales figures for the individual products.
Roll-up: A roll-up involves summarizing the data along a dimension. The summarization rule might be computing totals along a hierarchy or applying a set of formulas such as “profit = sales - expenses”.
Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter. Pivoting could replace products with time periods to see data across time for a single product
(Bonus 2p) Attempt running a TSP or other techniques to recover as well as possible the original image of tasks 1.-2.
(Bonus 2p) Load the same census data sets (you can attempt larger ones, too) into a DB and run SQL queries to achieve summarization as in pivot tables.
USE DM;
SELECT
Sex, SUM(Earnings)
FROM
extract_medium
GROUP BY Sex;
SELECT
Age, SUM(Earnings)
FROM
extract_medium
GROUP BY Age;
SELECT
Marriage, SUM(Earnings)
FROM
extract_medium
GROUP BY Marriage;
SELECT
Sex, Education, AVG(Earnings)
FROM
extract_medium
GROUP BY Education
ORDER BY Earnings DESC;
SELECT
State, Education, Sex, Age, Earnings
FROM
extract_medium
WHERE
Earnings < ANY (SELECT
Earnings
FROM
extract_medium)
AND Sex = 1;
SELECT
State, Education, Sex, Age, Earnings
FROM
extract_medium
WHERE
Earnings < ANY (SELECT
Earnings
FROM
extract_medium
WHERE
Education < 14)
AND Sex = 1
ORDER BY Earnings DESC;
SELECT
AVG(Earnings)
FROM
extract_medium
WHERE
Sex = 1 AND Marriage = 1
LIMIT 0 , 25000;
SELECT
AVG(Earnings)
FROM
extract_medium
WHERE
Sex = 1 AND Marriage = 2
LIMIT 0 , 25000;
SELECT
AVG(Earnings)
FROM
extract_medium
WHERE
Sex = 1 AND Marriage = 3
LIMIT 0 , 25000;
SELECT
AVG(Earnings)
FROM
extract_medium
WHERE
Sex = 1 AND Marriage = 4
LIMIT 0 , 25000;
SELECT
AVG(Earnings)
FROM
extract_medium
WHERE
Sex = 1 AND Marriage = 5
LIMIT 0 , 25000;
For this task I wanted to use phpMyAdmin but I got into some issues so I decided to USE and simple SQLite browser to import the csv file as a table into my database.
Generating a table that provides a view of a group in relation to their total earnings by age group, is inferred that age any age group below 16 i.e 15 and below are not included earn anything: it could therefore be concluded that thoses are not the workforce.
Subset and summarizing data by level of marrige / total status of the state of marraige May 1 categories, one could infer that group marraige was more gains than other while state 5 also follows.
From there, people without education, which is 0 with a level of training is no gain due their average scores of profits to zero.
Then rotate the column Gender, education and Averaige actually gain understanding comparison earning next to the other in education, i.e how education influences their earnings. And so, it is understood that as education levels 15 and 16 recieves more average earnings than others, while 15 to be precise have an advantage over 16. In general, the gains Education seems parallel in its sum.
Make more complex swivel abstract state, among others, were included for a low employees are considered and its interesting to see that most of those from the state Arizona.
Then, finally, using the similar request before, but this time knowing those earning high and the less the wiht education. As previously inferred, gender 1 gets higher earnings, most if not all that comes from Carlfonia state. This could mean higher income or standard of living in this American side with large companies paysw well … and also to the above, has also noticed the The higher your level of education, the higher your winnings.
rpivotTable(data = census2000, rows = "House_relation",cols = c("Sex","Education") , rendererName = "Heatmap")
rpivotTable(data = census2000, rows = "Ancestry",cols = c("Education") , rendererName = "Heatmap")